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Abstract. Searching for all occurrences of a pattern in a text is a fun- 
damental problem in computer science with applications in many other 
fields, like natural language processing, information retrieval and com- 
putational biology. In the last two decades a general trend has appeared 
trying to exploit the power of the word RAM model to speed-up the 
performances of classical string matching algorithms. In this model an 
algorithm operates on words of length w, grouping blocks of characters, 
and arithmetic and logic operations on the words take one unit of time. 
In this paper we use specialized word-size packed string matching instruc- 
tions, based on the Intel streaming SIMD extensions (SSE) technology, to 
design very fast string matching algorithms in the case of short patterns. 
From our experimental results it turns out that, despite their quadratic 
worst case time complexity, the new presented algorithms become the 
clear winners on the average for short patterns, when compared against 
the most effective algorithms known in literature. 



1 Introduction 

Given a text t of length n and a pattern p of length m over some alphabet £ of 
size cr, the exact string matching problem consists in finding all occurrences of 
the pattern p in t. This problem has been extensively studied in computer sci- 
ence because of its direct application to many areas. Moreover string matching 
algorithms are basic components in many software applications and play an im- 
portant role in theoretical computer science by providing challenging problems. 

In a computational model where the matching algorithm is restricted to read 
all the characters of the text one by one the optimal complexity is 0(n), and 
was achieved the first time by the well known Knuth-Morris-Pratt algorithm 
[20] (KMP). However in many practical cases it is possible to avoid reading all 
the characters of the text achieving sub-linear performances on average. The 
optimal average (9( " log ^ m ) time complexity [28] was reached for the first time 
by the Backward-DAWG-Matching algorithm [7] (BDM). However, most of the 
algorithms with a sub-linear average behavior may have to read all the text char- 
acters in the worst case. It is interesting to note that many of those algorithms 
have an even worse C(nm)-time complexity in the worst-case [6]. 

In the last two decades a lot of work has been made in order to exploit 
the power of the word RAM model of computation to speed-up classical string 



matching algorithms. In this model, the computer operates on words of length 
w, thus blocks of characters are read and processed at once. This means that 
usual arithmetic and logic operations on the words all take one unit of time. 

Most of the solutions which exploit the word RAM model are based on the 
bit-parallelism technique or on the packed string matching technique. 

The bit-parallelism technique [1] takes advantage of the intrinsic parallelism 
of the bit operations inside a computer word, allowing to cut down the number 
of operations that an algorithm performs by a factor up to w. Bit-parallelism is 
particularly suitable for the efficient simulation of nondetcrministic automaton. 
The first algorithm based on it, named Shift-Or [1] (SO), simulates efficiently the 
nondeterministic version of the KMP automaton and runs in 0(n\^~\), which 
is still considered among the best practical algorithms in the case of very short 
patterns and small alphabets [16,14]. Later a very fast BDM-like algorithm 
(BNDM) , based on the bit-parallel simulation of the nondeterministic suffix au- 
tomaton, was presented in [24]. Some variants of the BNDM algorithm [11, 13, 
8,25] are among the most practical efficient solutions in literature (see [16, 14]). 
However, the bit-parallel encoding requires one bit per pattern symbol, for a to- 
tal of [—] computer words. Thus, as long as a pattern fits in a computer word, 
bit-parallel algorithms are extremely fast, otherwise their performances degrades 
considerably as [— ] grows. Though there are a few techniques to maintain good 
performance in the case of long patterns [22, 9, 5], such limitation is intrinsic. 

In the packed string matching technique multiple characters are packed into 
one larger word, so that the characters can be compared in bulk rather than 
individually. In this context, if the characters of a string are drawn from an 
alphabet of size a, then Lj^t^J different characters fit in a single word, using 
[logcrj bits per characters. The packing factor is a — j^^- 

A first theoretical result in packed string matching was proposed by Frcdriks- 
son [17]. He presented a general scheme that can be applied to speed-up many 
pattern matching algorithms. His approach relies on the use of the four rus- 
sian technique (i.e. tabulation), achieving in favorable cases a 0(n e m)-space 
and 0( ml " gg + n e m + occ)-time complexity, where e > denotes an arbitrary 
small constant, and occ denotes the number of occurrences of p in t. Billc [4] pre- 
sented an alternative solution with 0( lpg " - +m + occ)-timc and C(n e + m)-space 
complexities by an efficient segmentation and coding of the KMP automaton. 
Recently Belazzougui [2] proposed a packed string matching algorithm which 
works in + ^ + m + occ) time and 0(m) space, reaching the optimal 

0(^+ occ)-time bound for a < m < ^. However, none of these results is of any 
practical interest. 

The first algorithm that achieves good practical and theoretical results was 
very recently proposed by Ben-Kiki et at [3]. The algorithm is based on two 
specialized packed string instructions, the pempestrm and the pempestri instruc- 
tions [19], and reaches the optimal 0{- + occ)-time complexity requiring only 
0(1) extra space. Moreover the authors showed that their algorithm turns out 
to be among the fastest string matching solutions in the case of very short pat- 
terns. However, it has to be noticed that on current generation Intel Sandy 
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Bridge processors, pempestrm and pempestri have 2-cycle throughput and 7- and 
8-cycle latency, respectively [19]. 

When the length of the searched pattern increases, another algorithm named 
Streaming SIMD Extensions Filter (SSEF), presented by Kiilekci in [21] (and 
extended to multiple pattern matching in [10]), exploits the advantages of the 
word-RAM model. Specifically it uses a filter method that inspects blocks of 
characters instead of reading them one by one. Despite its 0(nm) worst case time 
complexity, the SSEF algorithm turns out to be among the fastest solutions when 
searching for long patterns [16,14]. Efficient solutions have been also designed 
for searching on packed DNA sequences [26,12]. However in this paper we do 
not take into account this type of solutions since they require a different type of 
data representation. 

Streaming SIMD technology offers single-instructions to perform a variety of 
tests on packed strings. Unfortunately those instructions are heavier than other 
instructions provided in the same family as a consequence of their relatively 
high latencies. Hence, in this paper we focus on design of algorithms using in- 
structions with low latency and throughput, when compared with those used in 
[3] . Specifically we present a new practical and efficient algorithm for the exact 
packed string matching problem that turns out to be faster than the best algo- 
rithms known in literature in the case of short patterns. The algorithm, named 
Exact Packed String Matching (EPSM), is based on three different search pro- 
cedures used for, respectively, very short patterns (0 < m < ^), short patterns 
(f < m < a) and medium length patterns (m > a). They use specialized packed 
string instructions with a low latency and throughput, if compared with those 
used in [3]. All search procedures have an 0(nm) worst case time complexity. 
However, they have very good performances on average. In the case of very short 
patterns, i.e. when m < the first two search procedures achieve, respectively, 
a 0(n + occ) and an optimal + occ)-time complexity. 

The paper is organized as follows. In Section 2, we introduce some notions 
and terminologies. We then present a new algorithm for the packed string match- 
ing problem in Section 3 and report experimental results on short patterns in 
Section 4. Conclusions are given in Section 5. 

2 Notions and Terminology 

Throughout the paper we will make use of the following notations and terminol- 
ogy. A string p of length m > is represented as a finite array p[0 . . m — 1] of 
characters from a finite alphabet £ of size a. Thus p[i] will denote the (i + l)-st 
character of p, for < i < m, and p[i . .j] will denote the factor (or substring) 
of p contained between the (i + l)-st and the (j + l)-st characters of p, for 
< i < j < m. In some cases we will denote by pi the (i + l)-st character of p, 
so that pi = p[i] and p = p Q p x . . .p m -i- 

We indicate with symbol w the number of bits in a computer word and with 
symbol 7 = [log u] the number of bits used for encoding a single character of 
the alphabet S. The number of characters of the alphabet that fit in a single 



3 



word is shown by a = [w/-f\. Without lose in generality we will assume along 
the paper that 7 divides w and that a is an even value. 

In chunks of a characters, the string p is represented by an array P[0 . . k — 1] 
of length k = (in — l)/a + 1. In particular we denote P = P0P1P2 ■ ■ -Pk-i, 
where Pi = Pi a Pia+iPia+2 ■ ■ -Piu+u-i, for < i < k. The last block Pk-i is not 
complete if to mod a 7^ 0. In that case, the rightmost remaining characters of 
the block are set to zero. 

Although different values of a and 7 are possible, in most cases we assume 
that a — 16 and 7 = 8, which is the most common case when working with 
characters in ASCII code and in a word RAM model with 128-bit registers, 
which are almost all available in recent commodity processors supporting single 
instruction multiple data (SIMD) operations. 

Finally, we recall the notation of some bitwise infix operators on computer 
words, namely the bitwise and the bitwise or "|" and the left shift "<C" 
operator (which shifts to the left its first argument by a number of bits equal to 
its second argument). 



3 A New Packed String Matching Algorithm 

In this section we present a new packed string matching algorithm, named Exact 
Packed String Matching algorithm (EPSM), which turns out to be efficient in 
the case of short patterns. EPSM is based on three different auxiliary algorithms, 
which we name EPSMa, EPSM6 and EPSMc, respectively. 

The first two auxiliary algorithms are designed to search for patterns of 
length, at most, a/2. When the length of the pattern is longer than a/2 the 
algorithms adopt a filter mechanism: they first search for a substring of the 
pattern of length a/2 and, when a candidate occurrence has been found, a naive 
check follows. The third algorithm adopts a filtering based solution. 

All three algorithms run in 0(nm) worst case time complexity and use, re- 
spectively, C(min{m, a}), 0(1) and 0(2 k ) additional space, where k is a con- 
stant parameter. However, when to < a/2 the EPSMa and EPSM6 algorithms 
reach, respectively, an 0(ma + ^ + occ) and 0(^ + occ) time complexity. The 
first search procedure is designed to be extremely fast in the case of very short 
patterns, i.e. when to < ^, the second algorithm turns out to be a good choice 
when t| < m < a, while the third algorithm turns out to be effective when 
to > a. In practical cases we tuned the EPSM algorithm in order to run EPSMa 
when < to < 4, EPSMfe when 4 < to < 16, and to run EPSMc in all other 
cases. 

In what follows, we first describe in Section 3.1 the computational model we 
use for the description of our solutions. Then we independently present the three 
auxiliary algorithms EPSMa, in Section 3.2, EPSM6, in Section 3.3, and EPSMc 
in Section 3.4. 
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3.1 The Model 



In the design of our algorithms we use specialized word-size packed string match- 
ing instructions, based on the Intel streaming SIMD extensions (SSE) technology. 
SIMD instructions exist in many recent microprocessors supporting parallel ex- 
ecution of some operations on multiple data simultaneously via a set of special 
instructions working on limited number of special registers. Although the usage 
of SIMD is explored deeply in multimedia processing, implementation of encryp- 
tion/decryption algorithms, and on some scientific calculations, it has not been 
much addressed in pattern matching. 

In our model of computation we suppose that w is the number of bits in a 
word and a is the size of the alphabet. We indicate with the symbol a — 
the number of characters which fit in a single computer word. 

In most practical applications we have o = 256 (ASCII code). Moreover 
SSE specialized instructions allow to work on 128-bit registers, thus reading and 
processing blocks of sixteen 8-bit characters in a single time unit (thus a = 16). 

In the design of our algorithms we make use of the following specialized 
word-size packed instructions. For each instruction we describe how it could be 
emulated by using SSE specialized intrinsics. 

wscmp(a, b) (word-size compare instruction) 

Compares two w-bit words, handled as a block of a characters. In particular if 
a = a ai . . . a Q _i and b = b bi . . . are the two w-bit integer parameters, 
wscmp(a, b) returns an a-bit value r = r a ri . . . r Q _i, where = 1 if and only if 
a. t = bi, and = otherwise. Below we give an example of the application of 
wscmp(a, 6), assuming w — 48, 7 — 4 and a = 12. 



1 2 3 4 5 6 7 8 9 10 11 



6: 



0110. 


0010. 


0111.1010.0010.1110. 


0010. 


0100. 


0110. 


0111.0100. 


0010 


















0100. 


0010. 


0000.0111.1111.0010. 


0010. 


1100. 


0110. 


0100.1110. 


0010 
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The wscmp specialized instruction can be emulated in constant time by using 
the following sequence of specialized SIMD instructions 

h <r- _mm_cmpeq_epi8(a, b) 
r <— _mm_movemask_epi8(/i) 

Specifically the _mm_cmpeq_epi8 instruction compares two 128-bit words, 
handled as a block of sixteen 8-bit values, and returns a 128-bit value h = 
hohi . . . h\5, where hi = I s if and only if a, = bi, and hi = 8 otherwise. It has 
a 0.5-cycle throughput and a 1-cycle latency 

The _mm_movemask_epi8 instruction gets a 128 bit parameter h, handled as six- 
teen 8-bit integers, and creates a 16-bit mask from the most significant bits of 
the 16 integers in h, and zero extends the upper bits. 
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wsmatch(a, b) (word-size pattern matching instruction) 

reports all occurrences of a short string b in a w-bit parameter a, handled as a 
string of a characters. The parameter b is a string of length k < a. 
Specifically, if a = aofli . . . a a -i, and b = bob\ ...bk-i, then the wsmatch(o,6) 
instruction returns an a-bit integer value, r — r$ri . . . r a _i, where = 1 if and 
only if a i+ j = bj for j = . . . k — 1, i.e. an occurrence of b in a begins at position 
i. Notice that = for a — k < i < a, since no occurrence of b in a could begin 
at a position greater than a — k. Below we give an example of the application of 
wsmatch(a, 6), assuming w — 48, 7 = 4, a = 12 and fc = 3. 
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b: 



0110. 


1010.0111.1010. 


0100. 


1010.0111.1010. 


0000.1010.0100. 0010 


1010. 


0111. 


1010 















1 





1 






The wsmatch(a, b) instruction can be emulated in constant time by using the 
following sequence of SIMD specialized instructions 



h <r- _mm_mpsadbw_epu8(a, b) 
£ <— _mm_cmpeq_epi8(/i, z) 
r <— _mm_movemask_epi8(i?) 



where z is a 128-bit register with all bits set to 0, i.e. z — 128 . 

Specifically the _mm_mpsadbw_epu8(a, b) instruction gets two 128-bit words, 
handled as a block of sixteen 8-bit values, and returns a 128-bit value r = 
r$ri . . . rr, where is computed as r, = Y^j=o \ a i+j for i = ... 7. Thus we 
have that r, = 16 if and only if a i+ j = bj for j = 0. . .4, i.e. an occurrence of 
the prefix of b with length 4 begins in a at position i. The _mm_mpsadbw_epu8 
instruction has 1-cycle throughput and a 4-cycle latency. The _mm_cmpeq_epi8 
and _mm_movemask_epi8 instructions have been described above. 



wsblend(«, b) (word-size blend instruction) 

blends two w-bit parameters, handled as two blocks of a characters. Specifically 
if a = a ai . . . a Q _i and b — 6061 . . . b a -\, the instruction returns a w-bit integer 
r = r^ri . . . r a -i, where r< = cii+ a /2i if < « < a/2, and rj = b i _ a / 2 if a/2 < 
i < a, i.e. r = a°.a°.+i . . . a Q _i6ofei • • • &§-i- Below we give an example of the 
application of wsmatch(a, 6), assuming w — 48, 7 = 4 and a = 12. 
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1 2 3 4 5 6 7 8 9 10 11 



b: 



0110.0010.0111. 1010.0010. 1110. 


0010.0100.0110.0111.0100. 0010 


0100.0010.0000.0111.1111.0010. 


0010.1100.0110.0100.1110. 0010 






0010.0100.0110.0111.0100.0010.0100.0010.0000.0111.1111. 0010 



The wsblend(a, b) instruction can be emulated in constant time by using the 
following sequence of SIMD specialized instructions 

h <— _mm_blend_epil6(a, b, c) 

r <- _mm_shuffle_epi32(/i, _MM_SHUFFLE(1, 0, 3, 2)) 

Such instruction blends two 128-bit integers, a = a^a\ ... 07 and b = b$bi ... 67, 
handled as packed 16-bit integers, according to a third parameter c. In particular 
it returns a 128-bit integer r — r^r\ . . . r~j where = dj if Cj = 0, and rj = bi oth- 
erwise. If we set c = 64 1 64 we get r = aoaia2a 3 64& 5 6 6 &7. The _mm_blend_epil6 
instruction has 0.5-cycle throughput and a 1-cycle latency. 
The _mm_shuffle_epi32 instruction shuffles a w-bit parameter, a — 00010203, han- 
dled as four 32-bit values, according to the order of the _MM_SHUFFLE macro. 
In this case we get r = a 2 a 3 a n ai. The _mm_shuffle_epi32 instruction has 1-cycle 
throughput and a 1-cycle latency. 

wscrc(a) (word-size cyclic redundancy check instruction) 

computes the 32-bit cyclic redundancy checksum (CRC) signature for a w-bit 
parameter. It is an error-detecting code commonly used in digital networks and 
storage devices to detect accidental changes to raw data and can also be used 
as a hash function. 

The wscrc(a) instruction can be emulated in constant time by using the 
_mm_crc32_u64(a) SIMD specialized instructions, which computes the 32 bit 
cyclic redundancy check of a 64-bit block according to a polynomial. Such in- 
struction has a 1-cycle throughput and a 3-cycle latency, thus provides a robust 
and fast way of computing hash values. 

Additional specialized instructions 

In addition to the above listed instructions, given an a-bit register r, in our 
description we make use of the symbol {r} to indicate the set of bits in r whose 
value is set. More formally, given an a-bit register r = r rir 2 ■ ■ -r a -i, we have 
{r} — {i I < i < a and r% = 1}. Moreover, given a value s G N, we use for 
simplicity the expression s + {r} to indicate the set of values {s + i | i e {f }}• 
The cardinality of the set {r} can be computed in constant time by using the 
SIMD specialized instructions _mm_popcnt_u32(r) which calculates the number 
of bits of the parameter r that are set to 1. Such instruction has 1-cycle through- 
put and a 3-cycle latency. 



7 



Differently the list of values in {r} can be efficiently listed in 0(a)-time and 
0(l)-space, or using a tabulation approach, in C(|{r}|)-time and 0(2")-space. 
In the latter case we need a 0(a2 Q )-time preprocessing phase in order to address 
the 2 Q possible registers. 

We are now ready to describe the three auxiliary algorithms used in the 
EPSM algorithm. The pseudocode of the three algorithms is shown in Fig. 1. 

3.2 EPSMa: Searching for Very Short Patterns 

The EPSMa algorithm is designed to be extremely fast in the case of very short 
patterns and although it could be adapted to work for longer patterns its per- 
formances degrades as the length of the patterns increase. In practical cases the 
EPSM algorithm uses this procedure when < to < 4. The pseudocode of the 
algorithm is shown in Fig 1. 

The preprocessing of the algorithm (lines 1-4) is computed on the prefix 
of the pattern of length to' = min{m, |-}. If m' = m the whole pattern is 
preprocessed and searched, otherwise the algorithm works as a filter, searching 
for all occurrences of the prefix with length m' and, after an occurrence has been 
found, naively checking the whole occurrence of the pattern. 

Specifically the preprocessing phase consists in constructing an array B of to' 
different strings of length a. Each string of the array exactly fits in a word of w 
bits. The z-th string in the array B consists of a copies of the character pj. More 
formally the string B[i], for < i < m', is defined as B[i] = (pi) a . For instance, 
if p = ab is a pattern of length to — 2, 7 = 8 and w — 128, then B consists of two 
strings of length a — 16, defined as B[0] = a 16 and B[l] = b 16 . The preprocessing 
phase of the algorithm requires 0(min{m, ^}a)-time and C(min{m, |-})-space. 

The searching phase of the algorithm (lines 5-14) processes the text t in 
chunks of a characters. Let N = — — 1 and let T = TqTi . . . Tm be the string t 
represented in chunks of characters. Each block of the text, Tj, is compared with 
the strings in the array B using the instruction wscmp. 

Let Sj = bobi...b a —i be the a-bit register returned by the instruction 
wscmp(Ti, B[j}) 7 for < j < to'. It can be easily proved that bk — 1 if and only 
if the fc-th character of the block Tj is equal to pj, i.e. if and only if Ti[k] = pj 
(remember that B[j] = (pj) a ). Finally let r — r^ri . . . r a -\ be the a-bit register 
defined as r = s & (si < 1) & (s 2 < 2) & • • • & (s m '_i < (to' - 1)). 

It is easy to prove that p[0 . . to' — 1] has an occurrence beginning at position j 
of Ti if and only if rj = 1. In fact rj = 1 only if Sk[j + k] = 1, for k = . . . m! — 1, 
which implies that Ti\j + k] = Pk, for k = 0. . .to' — 1. Then, if to = to' the 
algorithm reports the occurrences of the pattern at positions ia + {r}, if any. 
Otherwise we know that occurrences of the prefix of the pattern with length a/2 
begin at positions ia + {r}. Thus the algorithm checks the occurrences beginning 
at those positions. 

If we maintain, for each value r, with < r < 2 a , a list of the values in 
the set {r}, the naive check of the occurrences can be done in C(|{r}|m)-time. 
When m = to' the occurrences can be reported in C(|{r}|)-time. Finally, observe 
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that the to' — 1 possible occurrences crossing the blocks 7$ and T i+ i are naively 
checked by the algorithm (lines 13-14). 

The overall time complexity of the EPSMa algorithm is 0(nm) 7 because in 
the worst case a naive check is required for each position of the text. However, 
when m < ^ the EPSMa algorithm achieves a 0(n+occ) time complexity, where 
occ is the number of occurrences of p in t. 

3.3 EPSMb: Searching for Short Patterns 

The EPSM6 searches for the whole pattern when its length is less or equal to 
a/2 and works as a filter algorithm for longer patterns. However, it is based on 
a more efficient filtering technique and turns out to be faster in the second case. 
In practical cases the EPSM algorithm uses this procedure when m > 4. The 
pseudocode of the EPSM6 algorithm is shown in Fig. 1 (in the middle). Notice 
that no preprocessing phase is needed by the algorithm. 

Let m! be the minimum between a/2 and to. Moreover let p' be the prefix of p 
of length to'. The searching phase of the algorithm (lines 3-14) processes the text 
t in chunks of a characters. Let N = — — 1 and let T = T Ti . . . TV be the string 
t represented in chunks of characters. Each block of the text, Tj, is searched one 
by one for occurrences of the string p' using the instruction wsmatch. 

Specifically, let r = r a ri . . . r a _i be the a-bit register returned by the in- 
struction wsmatch(Ti,p'), for < j < to'. We have that rj = 1 if and only if an 
occurrence of p' begins at positions j of the block 7$, for < j < a/2. Then, if 
m' = m (and hence p = p') the algorithm simply returns positions ia + j, such 
that rj = 1. Otherwise, if m' < to, the algorithm naively checks for the whole 
occurrences of the pattern starting at positions ia + j, such that rj = 1. 

Notice that generally packed string matching instructions allow to read only 
blocks Ti of a characters (128 bits in the case of SSE instructions), where 
Ti = t[ia..(i + \)a — 1]. Occurrences of the pattern beginning in the second 
half of the block Tj are checked separately. In particular a new block, S, ob- 
tained by applying the instruction wsblend(Ti, T i+ i), is processed in a similar 
way as block Tj. In this case we report all occurrences of the pattern beginning 
at positions ia + a/2 + j, with < j < a/2. One may argue that why blending 
is used instead of simply shifting the window. The reason is the SSE instruc- 
tions used in this context require the operands to be 16-byte aligned in memory, 
where the performance degrades significantly otherwise. Thus, blending is more 
advantageous. 

The resulting algorithm has an 0(nm) worst case time complexity and re- 
quire 0(1) additional space. When to < a/2 the algorithm reaches the optimal 
0(n/a + occ) worst case time complexity. 

3.4 EPSMc: Searching for Medium Length Patterns 

The EPSMc algorithm is designed to be faster for medium length patterns. It 
is based on a simple filtering method and uses a hash function for computing 
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EPSMa (p, to, t, n) 

1. to <— min{m, a/2} 

2. for j <- to (m - 1) do 

3. for j «— to a — 1 do 

4. Bi[7']<-P[»] 

5. for i <— to (n/a) — 1 do 

6. r <- l a 

7. for j «— to to' — 1 do 

8. Sj <— wscmp(Ti,Bj) 

9. r <- r & (sj < j) 

10. if m — to 

11. then report occurrences at ia + {r} 

12. else check positions ia + {r} 

13. for j <— to to — 2 do 

14. check position (i + l)a — j 

EPSMb (p,m,t,n) 

1. m' <— min{m,a/2} 

2. p' «— p[0 . . m' — 1] 

3. for i «— to (n/a) — 1 do 

4. r ^— wsmatch(Ti, p') 

5. if r ^ 0° then 

6. if m — m' 

7. then report occurrences at ia + {r} 

8. else check positions ia + {r} 

9. S <- wsblend(Ti,T i+ i) 

10. r «— wsmatch(S, p') 

11. if r # 0° then 

12. if m — m' 

13. then report occurrences at ia + -|- + {r} 

14. else check positions ia + ^ + {r} 

EPSMc (p, m, t, n) 



1. 


mask <- 0° fe l fc 




2. 


for 


i ^ 1 to m - a do 




3. 




v <— wscrc(p[i..i + a — 


1]) 


4. 




u <— ^ & mask 




5. 




L[v] <- L[u] U {»} 




6. 




•(— ([m/aj - 1) ■ a 




7. 


for 


i <- to (n/a) - 1 do 




8. 




wscrc(Ti) 




9. 




v «— v & mask 




10. 




for all j G do 




11. 




if < i — j < n — 


m 


12. 




then check position 


i — 


13. 




i <— i + sh 





Fig. 1. The EPSMa (on the top), the EPSMfe (in the middle) and the EPSMc (on the 
bottom) auxiliary algorithms. 



fingerprint values on blocks of a characters. The fingerprint values are computed 
by using a hash function h : S a — > {0, 1, . . . , 2 k — 1}, for a constant parameter 
k (in practice we chose k = 11). 

The function h is computed in a very fast way by using the wscrc specialized 
instruction, and in particular h(a) = wscrc(a) & a ~ k l k , for each x G S a . 

The pseudocode of the EPSMc algorithm is shown in Fig. 1. 
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During the preprocessing phase (lines 1-6) a fingerprint values of k bits is 
computed for all substrings of the pattern of length a. Then a table L of size 2 k 
is computed in order to store starting positions of all substrings of the pattern, 
indexed by their fingerprint values. In particular we have L[v] = {i | h(p[i..i + 
a - 1]) = v}, for all < v < 2 k . 

The preprocessing phase of the EPSMc algorithm takes 0(m + 2 fc )-time. 

Let N — — — 1 and let T = T Ti . . . T N be the string t represented in chunks 
of characters. During the searching phase (lines 7-13) the EPSMc algorithm 
inspects the blocks of the text in steps of ([m/a\ — 1) • a positions. For each 
inspected block T, the fingerprint value h(Ti) is computed and all positions in 
the set {ia — j j G F[h(Ti)}} are naively checked. The EPSMc algorithm has a 
0(nm) worst case time complexity but turns out to be very effective in practical 
cases. 

4 Experimental Results 

In this section we present experimental results in order to compare the perfor- 
mances of our newly presented algorithms against the best solutions known in 
literature in the case of short patterns. We consider all the fastest algorithms in 
the case of short patterns as listed in a recent experimental evaluation by Faro 
and Lecroq [16, 14]. In particular we compared the following algorithms: 

- the Hash algorithm using groups of q characters [23] (HASHg) ; 

- the Extended Backward Oracle Matching algorithm [11, 13] (EBOM); 

- the TVSBS algorithm [27] (TVSBS); 

- the Shift-Or algorithm [1] (SO) 

- the Shift-Or algorithm with g-grams [8] (UFNDMg); 

- the Fast- Average-Optimal-Shift-Or algorithm [18] (FAOSOg); 

- the Backward DAWG Matching algorithm using g-grams [8] (BNDMg); 

- the Simplified BNDMg algorithm [8] (SBNDMg); 

- the Forward BNDMg algorithm [11,13,25] (FBNDMg); 

- the Crochemore-Perrin algorithm using SSE instructions [3] (SSECP); 

- the EPSM algorithm presented in this paper. 

We remember that the EPSM algorithm consists of the EPSMa algorithm, 
when m < 4, of the EPSM6 algorithm when 4 < m < 16, and of the EPSMc 
algorithm when m > 16.. 

In the case of algorithms making use of q grams, the value of q ranges in 
the set {2,4,6}. All algorithms have been implemented in the C programming 
language and have been tested using the Smart tool [15] for exact string match- 
ing. The experiments were executed locally on a machine running Ubuntu 11.10 
(oneiric) with Intel i7-2600 processor with 16GB memory. Algorithms have been 
compared in terms of running times, including any preprocessing time. For the 
evaluation we used a genome sequence, a protein sequence and a natural lan- 
guage text (English language), all sequences of 4MB. The sequences are provided 
by the Smart research tool. For each input file, we have searched sets of 1000 
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Table 1. Experimental results for searching 1000 patterns on a genome sequence. 
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Table 2. Experimental results for searching 1000 patterns on a protein sequence. 



patterns of fixed length m randomly extracted from the text, for m ranging from 
2 to 32 (short patterns). Then, the mean of the running times has been reported. 

Table 1, Table 2 and Table 3 show the experimental results obtained for a 
gnome sequence, a protein sequence and a natural language text, respectively. 

In the case of algorithms using q-grams we have reported only the best result 
obtained by its variants. The values of q which obtained the best running times 
are reported as apices. Running times are expressed in hundredths of seconds, 
best results have been boldfaced and underlined, while the second best results 
have been boldfaced. 

From experimental results it turns out that the EPSM algorithm has mostly 
the best performances for short patterns. When searching on a genome sequence 
it is second only to the BNDMg algorithm for 12 < m < 14 and to the SSECP 
algorithm when m = 6. Observe however that the EPSM algorithm is (up to 2 
times) faster than the SSECP algorithm in most cases. 
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Table 3. Experimental results for searching 1000 patterns on a natural language text. 



When searching on a natural language text the EPSM algorithm obtains in 
most cases the best results, and is second to BNDM based algorithms only for 
20 < m < 22. 

For increasing lengths of the pattern the performances of the EPSM algo- 
rithm remain stable, underling a linear trend on average. However, the perfor- 
mances of other algorithms based on shift heuristics, slightly increases. This is 
more evident when searching on a protein sequence, where the algorithms based 
on bit-parallelism and q grams turn out to be the faster solutions for longer 
patterns. However, in this latter cases the EPSM algorithm is always very close 
the best solutions. 

It is interesting to observe that the EPSM algorithm is faster than the SSECP 
algorithm in almost all cases, and the gap is more evident in the case of longer 
patterns. In fact, despite to its optimal worst case time complexity, the SSECP 
algorithm shows an increasing trend on average, while the EPSM algorithm 
shows a linear behavior. 



5 Conclusions 



We presented a new packed exact string matching algorithm based on the Intel 
streaming SIMD extensions technology. The presented algorithm, named EPSM, 
is based on three auxiliary algorithms which are used when 0<m<4,m>4, 
and m > 16, respectively. Despite the O(nm)-worst case time complexity the 
resulting algorithm turns out to be very fast in the case of very short patterns. 
From our experimental results it turns out that the EPSM algorithm is in gen- 
eral the best solutions when m < 32. It could be interesting to investigate the 
possibility to improve the performances of packed string matching algorithms by 
introducing shift heuristics. 



13 



References 



1. R. Baeza- Yates and G.H. Gonnet. A new approach to text searching. Comm. of 
the ACM, 35(10):74-82, 1992. 

2. D. Belazzougui. Worst case efficient singel and multiple string matching in the 
RAM model. In Proceedings of the 21st International Workshop On Combinatorial 
Algorithms (IWOCA), pages 90-102, 2010. 

3. O. Ben-Kiki, P. Bille, D. Breslauer, L. Gasieniec, R. Grossi, and O. Weimann. 
Optimal packed string matching. In IARCS Annual Conf. on Found, of Software 
Technology and Theoretical Computer Science, pages 423-432, 2011. 

4. P. Bille. Fast searching in packed strings. Journal of Discrete Algorithms, 9(1):49- 
56, 2011. 

5. D. Cantone, S. Faro, and E. Giaquinta. A compact representation of nondeter- 
ministic (suffix) automata for the bit-parallel approach. In Combinatorial Pattern 
Matching, pages 288-298, 2010. 

6. C. Charras and T. Lecroq. Handbook of exact string matching algorithms. King's 
College, 2004. 

7. M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, 
and W. Rytter. Speeding up two string-matching algorithms. Algorithmica, 
12(4):247-267, 1994. 

8. B. Durian, J. Holub, H. Peltola, and J. Tarhio. Tuning bndm with q-grams. Pro- 
ceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX), 
pages 29-37, 2009. 

9. B. Durian, H. Peltola, L. Salmela, and J. Tarhio. Bit-parallel search algorithms for 
long patterns. In Intern. Symp. on Experimental Algorithms, pp. 129-140, 2010. 

10. S. Faro and M. O. Kiilekci. Fast multiple string matching using streaming simd ex- 
tensions technology. In Proceedings of the 19th International Symposium on String 
Processing and Information Retrieval, volume 7608 of Lecture Notes in Computer 
Science, pages 217-228. Springer Berlin, 2012. 

11. S. Faro and T. Lecroq. Efficient variants of the backward-oracle-matching algo- 
rithm. In Proc. of the Prague Stringology Conference 2008, pages 146-160, Czech 
Technical University in Prague, Czech Republic, 2008. 

12. S. Faro and T. Lecroq. An efficient matching algorithm for encoded dna sequences 
and binary strings. In Proceedings of the 20th Annual Symposium on Combinatorial 
Pattern Matching, CPM '09, pages 106-115, Berlin, Heidelberg, 2009. Springer- 
Verlag. 

13. S. Faro and T. Lecroq. Efficient variants of the backward-oracle-matching algo- 
rithm. Int. J. Found. Comput. Sci., 20(6):967-984, 2009. 

14. S. Faro and T. Lecroq. The exact string matching problem: a comprehensive 
experimental evaluation. Arxiv preprint arXiv:1012.2547, 2010. 

15. S. Faro and T. Lecroq. Smart: a string matching algorithm research tool. University 
of Catania and University of Rouen, http://www.dmi.unict.it/ faro/smart/, 2011. 

16. S. Faro and T. Lecroq. The exact online string matching problem: a review of the 
most recent results. ACM Computing Surveys, 45(2):to appear, 2013. 

17. K. Fredriksson. Faster string matching with super-alphabets. In String Processing 
and Information Retrieval, pages 207-214. Springer, 2002. 

18. K. Fredriksson and S. Grabowski. Practical and optimal string matching. In 
M. P. Consens and G. Navarro, editors, SPIRE, volume 3772 of Lecture Notes in 
Computer Science, pages 376-387. Springer- Verlag, Berlin, 2005. 



14 



19. Intel. Intel (R) 64 and IA-32 Architectures Optimization Reference Manual. Intel 
Corporation, 2011 

20. D.E. Knuth, J.H. Morris Jr, and V.R. Pratt. Fast pattern matching in strings. 
SIAM journal on computing, 6:323, 1977. 

21. M.O. Kiilekci. Filter based fast matching of long patterns by using SIMD instruc- 
tions. In Proc. of the Prague Stnngology Conference, pages 118-128, 2009. 

22. M.O. Kiilekci. Blim: A new bit-parallel pattern matching algorithm overcoming 
computer word size limitation. Mathem. in Computer Science, 3(4):407-420, 2010. 

23. T. Lecroq. Fast exact string matching algorithms. Information Processing Letters, 
102(6):229-235, 2007. 

24. G. Navarro and M. Raffinot. A bit-parallel approach to suffix automata: Fast 
extended string matching. In Combinatorial Pattern Matching, pages 14-33, 1998. 

25. H. Peltola and J. Tarhio. Variations of forward-SBNDM. In Jan Holub and Jan 
Zdarek, editors, Proceedings of the Prague Stnngology Conference 2011, pages 3- 
14, Czech Technical University in Prague, Czech Republic, 2011. 

26. J. Rautio, J. Tanninen, and J. Tarhio. String matching with stopper encoding and 
code splitting. In Proceedings of the 13th Annual Symposium on Combinatorial 
Pattern Matching, CPM '02, pages 42-52, London, UK, UK, 2002. Springer- Verlag. 

27. R. Thathoo, A. Virmani, S. Sai Lakshmi, N. Balakrishnan, and K. Sekar. TVSBS: 
A fast exact pattern matching algorithm for biological sequences. J. Indian Acad. 
Set., Current Set., 91(l):47-53, 2006. 

28. A.C. Yao. The complexity of pattern matching for a random string. SIAM J. 
Comput., 8(3):368-387, 1979. 



15 



